install.packages("tidyverse", Ncpus = 6)What is the Tidyverse?
The Tidyverse (https://www.tidyverse.org), is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Packages included in the Tidyverse are:
|
a system for declaratively creating graphics, based on The Grammar of Graphics |
grammar of data manipulation |
fast and friendly way to read rectangular data (like csv, tsv, and fwf) |
|
data.frames that are lazy and surly |
help you create tidy data |
cohesive set of functions designed to make working with strings as easy as possible |
|
suite of tools that solve common problems with factors |
helps with date-time data |
enhances R’s functional programming (FP) toolkit |
Installing the Tidyverse package
You can install R packages from several sources:
CRAN (Comprehensive R Archive Network)
Github
require("devtools") # install devtools before loading library library(devtools) # https://github.com/tidyverse/tidyverse devtools::install_github("tidyverse/tidyverse")Source file (tar.gz)
# path_to_file is the full path to the tar.gz file install.packages(path_to_file, repos = NULL, type = "source")RStudio (using the Tools –> Install Packages)
Loading the packages
We will load the tidyverse and palmerpenguins packages. The palmerpenguins package contains a dataset we will use to explore the many functions within the tidyverse.
library(tidyverse)
library(palmerpenguins) # load penguins dataPalmerpenguins Dataset
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica. Data from 344 penguins were recorded.
You can check out more data exploration and visualization with the palmerpenguins dataset here: palmerpenguins.
Lets explore the penguins dataset further.
penguins# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# ℹ 334 more rows
The penguins dataset is in a dataframe called a tibble. This tibble contains 344 rows x 8 columns. There is a type description for each variable:
int: integersdbl: doubles or real numberschr: character or stringsfct: factors
Lets examine the column data.
glimpse(penguins)Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, …
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torge…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6, 38.7, 42.5, 34.4, 46.0, 37.8, 37.7, 35.9, 38.2, 38.8, 35.3, 40.6, 40.5, 37.9, 40.5, 39.5, 37.2, 39…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8, 19.0, 20.7, 18.4, 21.5, 18.3, 18.7, 19.2, 18.1, 17.2, 18.9, 18.6, 17.9, 18.6, 18.9, 16.7, 18.1, 17…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, 195, 197, 184, 194, 174, 180, 189, 185, 180, 187, 183, 187, 172, 180, 178, 178, 188, 184, 195, 196, 190, 180, 181…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 3450, 4500, 3325, 4200, 3400, 3600, 3800, 3950, 3800, 3800, 3550, 3200, 3150, 3950, 3250, 3900, 33…
$ sex <fct> male, female, female, NA, female, male, female, male, NA, NA, NA, NA, female, male, male, female, female, male, female, male, female, male, female, male, male, female, male, female, female, ma…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
The column headers include:
species: Chinstrap, Gentoo, Adelie
island: Biscoe, Dream, Torgersen
year: 2007, 2008, 2009
sex: female, male
flipper_length_mm: flipper length (mm)
bill_length_mm: bill length (mm)
bill_depth_mm: bill depth (mm)
Importing Data with readr
The readr package has multiple methods to read in a data file depending on the file type.
read_csv(): comma-separated values (CSV)read_tsv(): tab-separated values (TSV)read_csv2(): semicolon-separated values with , as the decimal markread_delim(): delimited files (CSV and TSV are important special cases)read_fwf(): fixed-width filesread_table(): whitespace-separated filesread_log(): web log files
Examples of different file types
- Reading in Delimited files (e.g. “:”)
pg_delim <- read_delim(file = "data/penguins.txt", delim = ":", col_names = TRUE)- Reading in CSV files
pg_csv <- read_csv(file = "data/penguins.csv", col_names = TRUE)- Reading in TSV files
pg_tsv <- read_csv(file = "data/penguins.csv", col_names = TRUE)- Reading in Excel files
readxl package!
read_excel will determine whether the file is of .xls or .xlsx format. If you know the specific extension, use read_xls or read_xlsx instead.
require(readxl)
pg_xls <- read_xlsx(path = "data/penguins.xlsx", sheet = NULL, col_names = TRUE)- Reading in Googlesheeets files
googlesheets4 package!
You might see a message requesting authentication with the googlesheets4 package. Select 1 and follow the authorization process. You only need to do this once.
The googlesheets4 package is requesting access to your Google account.
Enter '1' to start a new auth process or select a pre-authorized account.
1: Send me to the browser for a new auth process.
2: email@ucr.edu
Selection:
require(googlesheets4)
URL <- "https://docs.google.com/spreadsheets/d/1dFh-U1P0PpJurRXpmXbDzFalLpZMsMn7HvjPZ--vznw/edit?usp=sharing"
pg_gsheet <- read_sheet(ss = URL, sheet = NULL, col_names = TRUE)Data Wrangling/Transformation with dplyr
This section introduces the many functions of the dplyr package for data transformation. There are five key functions in dplyr:
- Picking observations by their values (i.e., row) (filter())
- Reorder the rows (arrange())
- Pick variables by their names (i.e., column) (select())
- Create new variables with functions of existing variables (mutate())
- Collapse many values down to a single summary (summarize())
- Applying functions by group (group_by())
We will be using these functions to explore the palmerpenguin dataset (above).
Filter rows with filter()
filter() allows you to subset observations based on their values.
# female penguins only
filter(penguins, sex == "female")
# data collected from 2007 or 2008
filter(penguins, year == 2007 | year == 2008)
filter(penguins, year %in% c(2007,2008))
# penguins with bill_length < 40 or bill_depth < 20
filter(penguins, !(bill_length_mm > 40 | bill_depth_mm < 20))
filter(penguins, bill_length_mm <= 40, bill_depth_mm < 20)
# penguins with bill_length > 40 & body_mass > 3500
filter(penguins, bill_length_mm > 45 & body_mass_g > 4000)
# remove rows containing NA in bill length
filter(penguins, !is.na(bill_length_mm))Exercises
- How many penguins male penguins have bill length > 50?
- How many penguins from the Adelie species were on the Biscoe island?
Solutions
Code
# 1. How many penguins male penguins have bill length > 50?
filter(penguins, sex == "male" & bill_length_mm > 50)
# 2. How many penguins from the Adelie species were on the Biscoe island?
filter(penguins, species == "Adelie" & island == "Biscoe")Arrange rows with arrange()
arrange() works similarly to filter() except that instead of selecting rows, it changes their order.
# sort penguins by sex, species, island
arrange(penguins, sex, species, island)
# sort penguins by bill length, in descending order
arrange(penguins, desc(bill_length_mm))Exercises
- Sort the data by species, then by bill length (in descending order)
- Sort the data by island, then body mass (in descending order), then flipper length
Solutions
Code
# 1. Sort the data by species, then by bill length (in descending order)
arrange(penguins, species, desc(bill_length_mm))
# 2. Sort the data by island, then body mass (in descending order), then flipper length
arrange(penguins, island, desc(body_mass_g), flipper_length_mm)Select columns with select()
select() allows you to rapidly zoom in on a useful subset of variables based on the variable name
# select columns by name(e.g., species, bill length, and body mass)
select(penguins, species, bill_length_mm, body_mass_g)
# select all columns between species and bill depth (inclusive)
select(penguins, species:bill_depth_mm)
# select all columns except those from island to flipper length (inclusive)
select(penguins, -(island:flipper_length_mm))
# select the species column and all columms that begins with "bill"
select(penguins, species, starts_with("bill"))
# select the species column and all columns that ends with "mm"
select(penguins, species, ends_with("_mm"))
# select the species column and all columns with "length"
select(penguins, species, contains("length"))
# rename a variable (e.g. species to genera)
rename(penguins, genera = species)Exercises
- Select the island, species, and all columns containing “th”
- Select just the columns containing measurements
- Remove the body_mass_g column from the table
Solutions
Code
# 1. Select the island, species, and all columns containing "th"
select(penguins, island, species, contains("th"))
# 2. Select just the columns containing measurements
select(penguins, bill_length_mm:body_mass_g)
select(penguins, -c(species, island, sex, year))
# 3. Remove the body_mass_g column from the table
select(penguins, -(body_mass_g))Add new column variables with mutate()
mutate() always add new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables.
# create a subset of penguins data
penguins_sml <- select(penguins,
-c(island, year)
)
# create variable bill_length_cm
mutate(penguins_sml,
flipper_length_cm = flipper_length_mm / 10,
log10_body_mass_g = log10(body_mass_g)
)
# create new variables using other variables
mutate(penguins_sml,
ratio_bill_len_dep_mm = bill_length_mm / bill_depth_mm
)
# only display the new variables
transmute(penguins_sml,
ratio_bill_len_dep_mm = bill_length_mm / bill_depth_mm
)Exercises
- Create a new variable call index, where index is proportional to flipper length (mm) times the ratio of bill length (mm) to bill depth (mm)
- Create a new variable call bmi, where bmi is the index (in 1) divided by body mass (in kg)
Solutions
Code
# 1. Create a new variable call index, where index is proportional to flipper length (mm) times the ratio of bill length (mm) to bill depth (mm)
mutate(penguins, index = flipper_length_mm * (bill_length_mm / bill_depth_mm))
# 2. Create a new variable call bmi, where bmi is the index (in 1) divided by body mass (in kg)
mutate(penguins,
index = flipper_length_mm * (bill_length_mm / bill_depth_mm),
bmi = index / (body_mass_g/1000))Grouped summaries with summarise() and group_by()
summarise() collapses a data frame into a single row. Some useful summary functions include:
mean(x)median(x)sd(x)(standard deviation)n()countsum(!is.na(x))count of non-missing valuesn_distinct(x)count distinct values
group_by allows you to subset the data into groups (based on column(s) data)
# mean bill length for all penguins surveyed
summarise(penguins, mean_bill_len = mean(bill_length_mm, na.rm = TRUE))
# mean bill length by species and island
species_island <- group_by(penguins, species, island)
summarise(species_island, mean_bill_len = mean(bill_length_mm, na.rm = TRUE))
# summarize by muliple conditions on grouped data (species, island)
# number penguins, mean bill length, median flipper length, minimum body mass, maximum body mass
species_island <- group_by(penguins, species, island)
summarise(species_island, no_penguins = n(),
mean_bill_len = mean(bill_length_mm, na.rm = TRUE),
median_flipper_len = median(flipper_length_mm, na.rm = TRUE),
min_body_mass_g = min(body_mass_g, na.rm = TRUE),
max_body_mass_g = max(body_mass_g, na.rm = TRUE)
)Exercises
- Summarize by species, the number of penguins and the average body mass
- Summarize by species and sex, the number of penguins and the average body mass
Solutions
Code
# 1. Summarize by species, the number of penguins and the average body mass
species <- group_by(penguins, species)
summarise(species, no_penguins = n(), mean_body_mass = mean(body_mass_g))
# 2. Summarize by species and sex, the number of penguins and the average body mass
species_sex <- group_by(penguins, species, sex)
summarise(species_sex, no_penguins = n(), mean_body_mass = mean(body_mass_g))Using the pipe operator %>% or |> to link multiple commands
The pipe operator allows us to connect one command to the next without creating intermediate files.
%>% magrittr pipe operator comes from the magritrr package while the |> native pipe is built-in to base R after version 4.1. You need to load the magrittr or dplyr packages to use the magrittr pipe while the native pipe does not require any packages.
# mean bill length by species and island (without pipe)
species_island <- group_by(penguins, species, island)
summarise(species_island, mean_bill_len = mean(bill_length_mm, na.rm = TRUE))
# mean bill length by species and island (with pipe)
penguins |>
group_by(species, island) |>
summarise(mean_bill_len = mean(bill_length_mm, na.rm = TRUE))
# combine mulitple functions with pipe
penguins |>
group_by(species, island) |>
summarise(mean_bill_len = mean(bill_length_mm, na.rm = TRUE)) |>
filter(species == "Adelie")Exercises
- Summarize by species, the number of penguins and the average body mass
- Summarize by species and sex, the number of penguins and the average body mass
Solutions
Code
# 1. Summarize by species, the number of penguins and the average body mass
penguins |>
group_by(species) |>
summarise(no_penguins = n(),
mean_body_mass = mean(body_mass_g))
# 2. Summarize by species and sex, the number of penguins and the average body mass
penguins |>
group_by(species, sex) |>
summarise(no_penguins = n(),
mean_body_mass = mean(body_mass_g))Useful Commands
These are a few handy commands that you will likely encounter when wrangling your data.
# removing all rows containing NA in any column
penguins |> na.omit()
# removing rows containing NA from specific columns (e.g., bill_length, bill_depth)
filter_at(penguins, vars(bill_length_mm:sex), all_vars(!is.na(.)))
# renaming columns using select()
select(penguins, penguin_type = species, collection_year = year)
# write over previous column data
mutate(penguins, sex = case_when(
sex == "female" ~ "F",
sex == "male" ~ "M",
TRUE ~ NA
))
# convert data type (year from int to char)
mutate(penguins, year = as.character(year))Exporting files using readr
Similar to the functions used to import data into R, there are corresponding functions to export (i.e. write) data to files depending on the output data type
write_delim(): delimited files (CSV and TSV are important special cases)write_csv(): comma-separated valueswrite_tsv(): tab-separated values (TSV)write_excel_csv(): Excel format CSVwrite_sheet(): Googlesheet
Examples of different file types
- Writing to delimited file (e.g. “:”)
write_delim(object_name, file = "data/table.txt", delim = ":", col_names = TRUE)- Writing to CSV
write_csv(object_name, file = "data/table.csv", col_names = TRUE)- Writing to TSV
write_tsv(object_name, file = "data/table.tsv", col_names = TRUE)- Writing to Excel
write_excel_csv(object_name, file = "data/table.xls", col_names = TRUE)- Write to Googlesheets
write_delim(object_name, ss = "googlesheet_name", sheet = NULL)write_csv(object_name, file = "data/table.csv.gz", col_names = TRUE)
Exercise: Wrangling a metadata file
In this section, we will wrangle a metadata file by doing the following:
- Read in the file
- Examine the data structure
- Subset the data with the “Sample_ID”, “Treatment Group”, “Sequencing Depth (M)”, “Technician Name”, and “Sequencer_Platform”
- Rename the column headings (replacing those with space or - to underscore)
- For the Treatment Group, replace “High_Dose” to “high”, “Low_Dose” to “low”, “Control” to “control” and change the column name to “dosage”
- For the Technician Name, convert “alice” to “Alice” and “BOB” to “Bob”
- Create a total_cost column where cost is the sequencing depth * $100/M reads
From the data, address the following questions:
Summarize the total number of samples processed by each technician per sequencer platform
How many samples are suitable for further downstream analysis (requires a minimum of 35M reads per sample)
Solutions
Code
data_url <- "https://raw.githubusercontent.com/bioinformatics-workshop/Intro-to-Tidyverse-2025/refs/heads/main/data/metadata.csv"
metadata <- read_csv(data_url, col_names = TRUE)
metadata
glimpse(metadata)
metadata_subset <- metadata |>
select(Sample_ID, `Treatment Group`, `Sequencing Depth (M)`, `Technician Name`, Sequencer_Platform) |>
rename(dosage = `Treatment Group`,
seq_depth_M = `Sequencing Depth (M)`,
tech_name = `Technician Name`) |>
mutate(dosage = case_when(
dosage == "High_Dose" ~ "high",
dosage == "Low_Dose" ~ "low",
dosage == "Control" ~ "control",
TRUE ~ NA
)) |>
mutate(tech_name = case_when(
tech_name == "alice" ~ "Alice",
tech_name == "BOB" ~ "Bob",
TRUE ~ tech_name
)) |>
mutate(total_cost = seq_depth_M * 100)
# Addressing questions
# Summarize the total number of samples processed by each technician per sequencer platform
metadata_subset |>
group_by(tech_name, Sequencer_Platform) |>
summarise(no_samples = n())
# How many samples are suitable for further downstream analysis (requires a minimum of 35M reads per sample)
metadata_subset |>
filter(seq_depth_M > 35)Additional Resources
Data Science Workshops (Harvard Business School and Institute for Quantitative Social Science)
Session Info
sessionInfo()R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Rocky Linux 8.10 (Green Obsidian)
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.15.so; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics utils datasets grDevices methods base
other attached packages:
[1] palmerpenguins_0.1.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[11] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 jsonlite_1.8.9 compiler_4.4.2 tidyselect_1.2.1 scales_1.3.0 yaml_2.3.10 fastmap_1.2.0 R6_2.5.1 generics_0.1.3 httr2_1.1.2 knitr_1.48 htmlwidgets_1.6.4
[13] ellmer_0.2.1 munsell_0.5.1 tzdb_0.4.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4 stringi_1.8.4 xfun_0.49 S7_0.2.0 timechange_0.3.0 cli_3.6.3 withr_3.0.2
[25] magrittr_2.0.3 digest_0.6.37 grid_4.4.2 rstudioapi_0.17.1 hms_1.1.3 rappdirs_0.3.3 lifecycle_1.0.4 coro_1.1.0 vctrs_0.6.5 evaluate_1.0.1 glue_1.8.0 fansi_1.0.6
[37] colorspace_2.1-1 rmarkdown_2.29 tools_4.4.2 pkgconfig_2.0.3 htmltools_0.5.8.1